In [1]:
# @hidden_cell
# The project token is an authorization token that is used to access project resources like data sources, connections, and used by platform APIs.
from project_lib import Project
project = Project(project_id='...', project_access_token='...')

Cleaning NOAA Weather Data of JFK Airport (New York)

This notebook relates to the NOAA Weather Dataset - JFK Airport (New York). The dataset contains 114,546 hourly observations of 12 local climatological variables (such as temperature and wind speed) collected at JFK airport. This dataset can be obtained for free from the IBM Developer Data Asset Exchange.

In this notebook, we clean the raw dataset by:

  • removing redundant columns and preserving only key numeric columns
  • converting and cleaning data where required
  • creating a fixed time interval between observations (this aids with later time-series analysis)
  • filling missing values
  • encoding certain weather features

Table of Contents:

0. Prerequisites

Before you run this notebook complete the following steps:

  • Insert a project token
  • Import required modules

Insert a project token

When you import this project from the Watson Studio Gallery, a token should be automatically generated and inserted at the top of this notebook as a code cell such as the one below:

# @hidden_cell
# The project token is an authorization token that is used to access project resources like data sources, connections, and used by platform APIs.
from project_lib import Project
project = Project(project_id='YOUR_PROJECT_ID', project_access_token='YOUR_PROJECT_TOKEN')
pc = project.project_context

If you do not see the cell above, follow these steps to enable the notebook to access the dataset from the project's resources:

  • Click on More -> Insert project token in the top-right menu section

ws-project.mov

  • This should insert a cell at the top of this notebook similar to the example given above.

    If an error is displayed indicating that no project token is defined, follow these instructions.

  • Run the newly inserted cell before proceeding with the notebook execution below

Import required modules

Import and configure the required modules.

In [2]:
# Define required imports
import pandas as pd
import numpy as np
import sys
import re
# These set pandas max column and row display in the notebook
pd.set_option('display.max_columns', 50)
pd.set_option('display.max_rows', 50)

1. Read the Raw Data

We start by reading in the raw dataset, displaying the first few rows of the dataframe, and taking a look at the columns and column types present.

In [3]:
# define filename
DATA_PATH = 'jfk_weather.csv'

def get_file_handle(fname):
    # Project data path for the raw data file
    data_path = project.get_file(fname)
    data_path.seek(0)
    return data_path

# Using pandas to read the data 
# Since the `DATE` column consists date-time information, we use Pandas parse_dates keyword for easier data processing
data_path = get_file_handle(DATA_PATH)
raw_data = pd.read_csv(data_path, parse_dates=['DATE'])
raw_data.head()
/opt/conda/envs/Python36/lib/python3.6/site-packages/IPython/core/interactiveshell.py:3020: DtypeWarning: Columns (8,10,11,12,13,14,15,16,17,20,22,23,25,40,63,64,67,68,69,84,85,86,87) have mixed types. Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)
Out[3]:
STATION STATION_NAME ELEVATION LATITUDE LONGITUDE DATE REPORTTPYE HOURLYSKYCONDITIONS HOURLYVISIBILITY HOURLYPRSENTWEATHERTYPE HOURLYDRYBULBTEMPF HOURLYDRYBULBTEMPC HOURLYWETBULBTEMPF HOURLYWETBULBTEMPC HOURLYDewPointTempF HOURLYDewPointTempC HOURLYRelativeHumidity HOURLYWindSpeed HOURLYWindDirection HOURLYWindGustSpeed HOURLYStationPressure HOURLYPressureTendency HOURLYPressureChange HOURLYSeaLevelPressure HOURLYPrecip ... MonthlyGreatestPrecip MonthlyGreatestPrecipDate MonthlyGreatestSnowfall MonthlyGreatestSnowfallDate MonthlyGreatestSnowDepth MonthlyGreatestSnowDepthDate MonthlyDaysWithGT90Temp MonthlyDaysWithLT32Temp MonthlyDaysWithGT32Temp MonthlyDaysWithLT0Temp MonthlyDaysWithGT001Precip MonthlyDaysWithGT010Precip MonthlyDaysWithGT1Snow MonthlyMaxSeaLevelPressureValue MonthlyMaxSeaLevelPressureDate MonthlyMaxSeaLevelPressureTime MonthlyMinSeaLevelPressureValue MonthlyMinSeaLevelPressureDate MonthlyMinSeaLevelPressureTime MonthlyTotalHeatingDegreeDays MonthlyTotalCoolingDegreeDays MonthlyDeptFromNormalHeatingDD MonthlyDeptFromNormalCoolingDD MonthlyTotalSeasonToDateHeatingDD MonthlyTotalSeasonToDateCoolingDD
0 WBAN:94789 JFK INTERNATIONAL AIRPORT NY US 3.4 40.6386 -73.7622 2010-01-01 00:51:00 FM-15 FEW:02 7 SCT:04 13 BKN:07 29 6.00 -RA:02 PL:06 BR:1 |RA:61 PL:74 |RA:61 PL:79 33 0.6 32 0.1 31 -0.6 92 0 000 NaN 29.97 8.0 NaN 29.99 0.01 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN -9999 -9999 NaN -9999 -9999 NaN NaN NaN NaN NaN NaN
1 WBAN:94789 JFK INTERNATIONAL AIRPORT NY US 3.4 40.6386 -73.7622 2010-01-01 01:00:00 FM-12 NaN 5.59 ||PL:79 33 0.6 32 0.1 31 -0.6 92 0 000 NaN 29.96 8.0 +0.05 29.99 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN -9999 -9999 NaN -9999 -9999 NaN NaN NaN NaN NaN NaN
2 WBAN:94789 JFK INTERNATIONAL AIRPORT NY US 3.4 40.6386 -73.7622 2010-01-01 01:51:00 FM-15 FEW:02 6 SCT:04 13 OVC:08 35 6.00 -RA:02 PL:06 BR:1 |RA:61 PL:74 |RA:61 PL:79 33 0.6 33 0.3 32 0.0 96 0 000 NaN 29.97 NaN NaN 29.99 0.02 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN -9999 -9999 NaN -9999 -9999 NaN NaN NaN NaN NaN NaN
3 WBAN:94789 JFK INTERNATIONAL AIRPORT NY US 3.4 40.6386 -73.7622 2010-01-01 02:03:00 FM-16 FEW:02 6 SCT:04 13 OVC:08 35 6.00 -RA:02 BR:1 |RA:61 |RA:61 34 1 33 0.7 32 0.0 93 0 000 NaN 29.97 NaN NaN NaN T ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN -9999 -9999 NaN -9999 -9999 NaN NaN NaN NaN NaN NaN
4 WBAN:94789 JFK INTERNATIONAL AIRPORT NY US 3.4 40.6386 -73.7622 2010-01-01 02:28:00 FM-16 BKN:07 7 BKN:07 15 OVC:08 35 5.00 -RA:02 BR:1 |RA:61 |RA:61 34 1 33 0.7 32 0.0 93 0 000 NaN 29.97 NaN NaN NaN T ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN -9999 -9999 NaN -9999 -9999 NaN NaN NaN NaN NaN NaN

5 rows × 90 columns

In [4]:
raw_data.dtypes
Out[4]:
STATION                                      object
STATION_NAME                                 object
ELEVATION                                   float64
LATITUDE                                    float64
LONGITUDE                                   float64
DATE                                 datetime64[ns]
REPORTTPYE                                   object
HOURLYSKYCONDITIONS                          object
HOURLYVISIBILITY                             object
HOURLYPRSENTWEATHERTYPE                      object
HOURLYDRYBULBTEMPF                           object
HOURLYDRYBULBTEMPC                           object
HOURLYWETBULBTEMPF                           object
HOURLYWETBULBTEMPC                           object
HOURLYDewPointTempF                          object
HOURLYDewPointTempC                          object
HOURLYRelativeHumidity                       object
HOURLYWindSpeed                              object
HOURLYWindDirection                          object
HOURLYWindGustSpeed                         float64
HOURLYStationPressure                        object
HOURLYPressureTendency                      float64
HOURLYPressureChange                         object
HOURLYSeaLevelPressure                       object
HOURLYPrecip                                 object
                                          ...      
MonthlyGreatestPrecip                       float64
MonthlyGreatestPrecipDate                   float64
MonthlyGreatestSnowfall                      object
MonthlyGreatestSnowfallDate                  object
MonthlyGreatestSnowDepth                     object
MonthlyGreatestSnowDepthDate                float64
MonthlyDaysWithGT90Temp                     float64
MonthlyDaysWithLT32Temp                     float64
MonthlyDaysWithGT32Temp                     float64
MonthlyDaysWithLT0Temp                      float64
MonthlyDaysWithGT001Precip                  float64
MonthlyDaysWithGT010Precip                  float64
MonthlyDaysWithGT1Snow                      float64
MonthlyMaxSeaLevelPressureValue             float64
MonthlyMaxSeaLevelPressureDate                int64
MonthlyMaxSeaLevelPressureTime                int64
MonthlyMinSeaLevelPressureValue             float64
MonthlyMinSeaLevelPressureDate                int64
MonthlyMinSeaLevelPressureTime                int64
MonthlyTotalHeatingDegreeDays                object
MonthlyTotalCoolingDegreeDays                object
MonthlyDeptFromNormalHeatingDD               object
MonthlyDeptFromNormalCoolingDD               object
MonthlyTotalSeasonToDateHeatingDD           float64
MonthlyTotalSeasonToDateCoolingDD           float64
Length: 90, dtype: object

2. Clean the Data

As you can see above, there are a lot of fields which are non-numerical - usually these will be fields that contain text or categorical data, e.g. HOURLYSKYCONDITIONS.

There are also fields - such as the main temperature field of interest HOURLYDRYBULBTEMPF - that we expect to be numerical, but are instead object type. This often indicates that there may be missing (or null) values, or some other unusual readings that we may have to deal with (since otherwise the field would have been fully parsed as a numerical data type).

In addition, some fields relate to hourly observations, while others relate to daily or monthly intervals. For purposes of later exploratory data analysis, we will restrict the dataset to a certain subset of numerical fields that relate to hourly observations.

In this section, we refer to the NOAA Local Climatological Data Documentation to describe the fields and meaning of various values.

2.1 Select data columns

First, we select only the subset of data columns of interest and inspect the column types.

In [5]:
# Choose what columns to import from raw data
column_subset = [
    'DATE',
    'HOURLYVISIBILITY',
    'HOURLYDRYBULBTEMPF',
    'HOURLYWETBULBTEMPF',
    'HOURLYDewPointTempF',
    'HOURLYRelativeHumidity',
    'HOURLYWindSpeed',
    'HOURLYWindDirection',
    'HOURLYStationPressure',
    'HOURLYPressureTendency',
    'HOURLYSeaLevelPressure',
    'HOURLYPrecip',
    'HOURLYAltimeterSetting'
]

# Filter dataset to relevant columns
hourly_data = raw_data[column_subset]
# Set date index
hourly_data = hourly_data.set_index(pd.DatetimeIndex(hourly_data['DATE']))
hourly_data.drop(['DATE'], axis=1, inplace=True)
hourly_data.replace(to_replace='*', value=np.nan, inplace=True)
hourly_data.head()
Out[5]:
HOURLYVISIBILITY HOURLYDRYBULBTEMPF HOURLYWETBULBTEMPF HOURLYDewPointTempF HOURLYRelativeHumidity HOURLYWindSpeed HOURLYWindDirection HOURLYStationPressure HOURLYPressureTendency HOURLYSeaLevelPressure HOURLYPrecip HOURLYAltimeterSetting
DATE
2010-01-01 00:51:00 6.00 33 32 31 92 0 000 29.97 8.0 29.99 0.01 29.99
2010-01-01 01:00:00 5.59 33 32 31 92 0 000 29.96 8.0 29.99 NaN NaN
2010-01-01 01:51:00 6.00 33 33 32 96 0 000 29.97 NaN 29.99 0.02 29.99
2010-01-01 02:03:00 6.00 34 33 32 93 0 000 29.97 NaN NaN T 29.99
2010-01-01 02:28:00 5.00 34 33 32 93 0 000 29.97 NaN NaN T 29.99
In [6]:
hourly_data.dtypes
Out[6]:
HOURLYVISIBILITY           object
HOURLYDRYBULBTEMPF         object
HOURLYWETBULBTEMPF         object
HOURLYDewPointTempF        object
HOURLYRelativeHumidity     object
HOURLYWindSpeed            object
HOURLYWindDirection        object
HOURLYStationPressure      object
HOURLYPressureTendency    float64
HOURLYSeaLevelPressure     object
HOURLYPrecip               object
HOURLYAltimeterSetting     object
dtype: object

2.2 Clean up precipitation column

From the dataframe preview above, we can see that the column HOURLYPrecip - which is the hourly measure of precipitation levels - contains both NaN and T values. T specifies trace amounts of precipitation, while NaN means not a number, and is used to denote missing values.

We can also inspect the unique values present for the field.

In [7]:
hourly_data['HOURLYPrecip'].unique()
Out[7]:
array(['0.01', nan, '0.02', 'T', '0.00', '0.10', '0.07', '0.03', '0.05',
       '0.15', '0.06', '0.08', '0.04', '0.09', '0.11', '0.20', '0.13',
       '0.14', '0.05s', '0.12', '0.24', '0.02s', '0.17', '0.16', '0.47',
       '0.28', '0.32', '0.19', '0.22', '0.18', '0.26', '0.23', '0.30',
       '0.33', '0.04s', '0.36', '0.40', '0.41', '0.51', '0.50', '0.68s',
       '0.21', '0.35', '0.54', '0.01s', '0.34', '0.94', '1.14', '1.18',
       '1.15', '0.65', '0.74', '0.39', '0.67', '0.29', '0.46', '0.87',
       '1.10', '0.52', '0.08s', '0.37', '0.42', '0.03s', '0.09s', '0.11s',
       '0.10s', '0.25', '0.33s', '0.13s', '0.38', '0.64s', '0.06s',
       '0.17s', '0.69s', '0.73', '2.41', '0.57', '0.84', '0.86', '0.27',
       '0.60', '0.45', '0.75', '1.26', '0.59', '0.68', '0.79', '0.70',
       '0.44', '0.43', '0.53', '0.37s', '0.77', '0.85', '0.93', '0.55',
       '0.56', '1.52', '1.05', '1.76', '0.07s', '0.63', '1.19', '0.12s',
       '0.18s', '0.58', '0.48', '0.66', '0.29s', '0.71', '0.20s', '0.31',
       '0.020.01s', '0.62', '0.49', '0.69', '0.64', '0.80', '1.54',
       '0.50s', '0.24s', '0.61', '0.98', '1.78', '1.67', '0.26s', '0.19s',
       '0.48s', '0.36s', '0.28s', '1.45', '0.92', '0.95', '0.16s',
       '0.15s', '0.58s', '1.07', '1.37', '0.14s', '0.25s', '0.40s',
       '1.16'], dtype=object)

We can see that some values end with an s (indicating snow), while there is a strange value 0.020.01s which appears to be an error of some sort. To deal with T values, we will set the observation to be 0. We will also replace the erroneous value 0.020.01s with NaN.

In [8]:
# Fix imported data
hourly_data['HOURLYPrecip'].replace(to_replace='T', value='0.00', inplace=True)
hourly_data['HOURLYPrecip'].replace('0.020.01s', np.nan, inplace=True)

2.3 Convert columns to numerical types

Next, we will convert string columns that refer to numerical values to numerical types. For columns such as HOURLYPrecip, we first also drop the non-numerical parts of the value (the s character).

In [9]:
# Set of columns to convert
messy_columns = column_subset[1:]

# Convert columns to float32 datatype
for i in messy_columns:
    hourly_data[i] = hourly_data[i].apply(lambda x: re.sub('[^0-9,.-]', '', x) if type(x) == str else x).replace('', np.nan).astype(('float32'))

We can now see that all fields have numerical data type.

In [10]:
print(hourly_data.info())
print()
hourly_data.head()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 114545 entries, 2010-01-01 00:51:00 to 2018-07-27 23:59:00
Data columns (total 12 columns):
HOURLYVISIBILITY          96464 non-null float32
HOURLYDRYBULBTEMPF        111403 non-null float32
HOURLYWETBULBTEMPF        111245 non-null float32
HOURLYDewPointTempF       111395 non-null float32
HOURLYRelativeHumidity    111395 non-null float32
HOURLYWindSpeed           111358 non-null float32
HOURLYWindDirection       110444 non-null float32
HOURLYStationPressure     111256 non-null float32
HOURLYPressureTendency    48038 non-null float32
HOURLYSeaLevelPressure    99522 non-null float32
HOURLYPrecip              79762 non-null float32
HOURLYAltimeterSetting    86878 non-null float32
dtypes: float32(12)
memory usage: 6.1 MB
None

Out[10]:
HOURLYVISIBILITY HOURLYDRYBULBTEMPF HOURLYWETBULBTEMPF HOURLYDewPointTempF HOURLYRelativeHumidity HOURLYWindSpeed HOURLYWindDirection HOURLYStationPressure HOURLYPressureTendency HOURLYSeaLevelPressure HOURLYPrecip HOURLYAltimeterSetting
DATE
2010-01-01 00:51:00 6.00 33.0 32.0 31.0 92.0 0.0 0.0 29.969999 8.0 29.99 0.01 29.99
2010-01-01 01:00:00 5.59 33.0 32.0 31.0 92.0 0.0 0.0 29.959999 8.0 29.99 NaN NaN
2010-01-01 01:51:00 6.00 33.0 33.0 32.0 96.0 0.0 0.0 29.969999 NaN 29.99 0.02 29.99
2010-01-01 02:03:00 6.00 34.0 33.0 32.0 93.0 0.0 0.0 29.969999 NaN NaN 0.00 29.99
2010-01-01 02:28:00 5.00 34.0 33.0 32.0 93.0 0.0 0.0 29.969999 NaN NaN 0.00 29.99

2.4 Reformat and process data

Next, we will clean up some of the data columns to ensure their values fall within the parameters defined by the NOAA documentation (referred to above).

In [11]:
# Generate the summary statistics for each column
hourly_data.describe()
Out[11]:
HOURLYVISIBILITY HOURLYDRYBULBTEMPF HOURLYWETBULBTEMPF HOURLYDewPointTempF HOURLYRelativeHumidity HOURLYWindSpeed HOURLYWindDirection HOURLYStationPressure HOURLYPressureTendency HOURLYSeaLevelPressure HOURLYPrecip HOURLYAltimeterSetting
count 96464.000000 111403.000000 111245.000000 111395.000000 111395.000000 111358.000000 110444.000000 111256.000000 48038.000000 99522.000000 79762.000000 86878.000000
mean 8.758680 55.310612 49.703098 43.309196 67.132141 11.268045 192.511597 29.990612 4.388963 30.026096 0.007032 30.009375
std 2.743144 17.169971 16.061544 19.356941 20.278189 6.175224 107.608009 0.238463 2.741891 0.233694 0.042610 0.239828
min 0.000000 1.000000 -1.000000 -22.000000 8.000000 0.000000 0.000000 28.490000 0.000000 28.540001 0.000000 28.510000
25% 9.940000 42.000000 37.000000 29.000000 51.000000 7.000000 110.000000 29.850000 2.000000 29.879999 0.000000 29.870001
50% 10.000000 56.000000 50.000000 45.000000 69.000000 10.000000 200.000000 29.990000 5.000000 30.020000 0.000000 30.010000
75% 10.000000 70.000000 64.000000 60.000000 85.000000 15.000000 280.000000 30.139999 7.000000 30.170000 0.000000 30.160000
max 99.419998 102.000000 85.000000 84.000000 100.000000 53.000000 360.000000 30.830000 8.000000 30.850000 2.410000 30.850000

According to the documentation, the HOURLYPressureTendency field should be an integer value in the range [0, 8]. Let's check if this condition holds for this dataset.

In [12]:
# Check if categorical variable HOURLYPressureTendency ever has a non-integer entry outside the bounds of 0-8
cond = len(hourly_data[~hourly_data['HOURLYPressureTendency'].isin(list(range(0,9)) + [np.nan])])
print('Hourly Pressure Tendency should be between 0 and 8: {}'.format(cond == 0))
Hourly Pressure Tendency should be between 0 and 8: True

The HOURLYVISIBILITY should be an integer in the range [0, 10]. Let's check this condition too.

In [13]:
# Hourly Visibility should be between 0 and 10
hourly_data[(hourly_data['HOURLYVISIBILITY'] < 0) | (hourly_data['HOURLYVISIBILITY'] > 10)]
Out[13]:
HOURLYVISIBILITY HOURLYDRYBULBTEMPF HOURLYWETBULBTEMPF HOURLYDewPointTempF HOURLYRelativeHumidity HOURLYWindSpeed HOURLYWindDirection HOURLYStationPressure HOURLYPressureTendency HOURLYSeaLevelPressure HOURLYPrecip HOURLYAltimeterSetting
DATE
2011-10-16 11:51:00 14.000000 68.0 53.0 37.0 33.0 16.0 250.0 29.85 NaN NaN NaN 29.870001
2015-06-21 17:45:00 99.419998 79.0 72.0 68.0 70.0 37.0 310.0 29.74 NaN NaN 0.0 29.760000

We find that a couple of observations fall outside the range. These must be spurious data observations and we handle them by replacing them with NaN.

In [14]:
# Replace any hourly visibility figure outside these bounds with nan
hourly_data.loc[hourly_data['HOURLYVISIBILITY'] > 10, 'HOURLYVISIBILITY'] = np.nan

# Hourly Visibility should be between 0 and 10
cond = len(hourly_data[(hourly_data['HOURLYVISIBILITY'] < 0) | (hourly_data['HOURLYVISIBILITY'] > 10)])
print('Hourly Visibility should be between 0 and 10: {}'.format(cond == 0))
Hourly Visibility should be between 0 and 10: True

Finally, we check if there are any duplicates with respect to our DATE index and check furthermore that our dates are in the correct order (that is, strictly increasing).

In [15]:
cond = len(hourly_data[hourly_data.index.duplicated()].sort_index())
print('Date index contains no duplicate entries: {}'.format(cond == 0))
Date index contains no duplicate entries: True
In [16]:
# Make sure time index is sorted and increasing
print('Date index is strictly increasing: {}'.format(hourly_data.index.is_monotonic_increasing))
Date index is strictly increasing: True

2.5 Create a fixed interval dataset

Most time-series analysis requires (or certainly works much better with) data that has fixed measurement intervals. As you may have noticed from the various data samples above, the measurement intervals for this dataset are not exactly hourly. So, we will use Pandas' resampling functionality to create a dataset that has exact hourly measurement intervals.

In [17]:
# Resample (downsample) to hourly rows (we're shifting everything up by 9 minutes!)
hourly_data = hourly_data.resample('60min').last().shift(periods=1) #Note: use resample('60min', base=51) to resample on the 51st of every hour

We will now also replace missing values. For numerical values, we will linearly interpolate between the previous and next valid obvservations. For the categorical HOURLYPressureTendency field, we will replace missing values with the last valid observation.

In [18]:
hourly_data['HOURLYPressureTendency'] = hourly_data['HOURLYPressureTendency'].fillna(method='ffill') # fill with last valid observation
hourly_data = hourly_data.interpolate(method='linear') # interpolate missing values
hourly_data.drop(hourly_data.index[0], inplace=True) # drop first row
In [19]:
print(hourly_data.info())
print()
hourly_data.head()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 75119 entries, 2010-01-01 01:00:00 to 2018-07-27 23:00:00
Freq: 60T
Data columns (total 12 columns):
HOURLYVISIBILITY          75119 non-null float32
HOURLYDRYBULBTEMPF        75119 non-null float32
HOURLYWETBULBTEMPF        75119 non-null float32
HOURLYDewPointTempF       75119 non-null float32
HOURLYRelativeHumidity    75119 non-null float32
HOURLYWindSpeed           75119 non-null float32
HOURLYWindDirection       75119 non-null float32
HOURLYStationPressure     75119 non-null float32
HOURLYPressureTendency    75119 non-null float32
HOURLYSeaLevelPressure    75119 non-null float32
HOURLYPrecip              75119 non-null float32
HOURLYAltimeterSetting    75119 non-null float32
dtypes: float32(12)
memory usage: 4.0 MB
None

Out[19]:
HOURLYVISIBILITY HOURLYDRYBULBTEMPF HOURLYWETBULBTEMPF HOURLYDewPointTempF HOURLYRelativeHumidity HOURLYWindSpeed HOURLYWindDirection HOURLYStationPressure HOURLYPressureTendency HOURLYSeaLevelPressure HOURLYPrecip HOURLYAltimeterSetting
DATE
2010-01-01 01:00:00 6.0 33.0 32.0 31.0 92.0 0.0 0.0 29.969999 8.0 29.990000 0.01 29.990000
2010-01-01 02:00:00 6.0 33.0 33.0 32.0 96.0 0.0 0.0 29.969999 8.0 29.990000 0.02 29.990000
2010-01-01 03:00:00 5.0 33.0 33.0 32.0 96.0 0.0 0.0 29.969999 8.0 29.990000 0.00 29.990000
2010-01-01 04:00:00 5.0 33.0 33.0 32.0 96.0 0.0 0.0 29.950001 8.0 29.969999 0.00 29.969999
2010-01-01 05:00:00 5.0 33.0 32.0 31.0 92.0 0.0 0.0 29.930000 8.0 29.959999 0.00 29.950001

2.6 Feature encoding

The final pre-processing step we will perform will be to handle two of our columns in a special way in order to correctly encode these features. They are:

  1. HOURLYWindDirection - wind direction
  2. HOURLYPressureTendency - an indicator of pressure changes

For HOURLYWindDirection, we encode the raw feature value as two new values, which measure the cyclical nature of wind direction - that is, we are encoding the compass-point nature of wind direction measurements.

In [20]:
# Transform HOURLYWindDirection into a cyclical variable using sin and cos transforms
hourly_data['HOURLYWindDirectionSin'] = np.sin(hourly_data['HOURLYWindDirection']*(2.*np.pi/360))
hourly_data['HOURLYWindDirectionCos'] = np.cos(hourly_data['HOURLYWindDirection']*(2.*np.pi/360))
hourly_data.drop(['HOURLYWindDirection'], axis=1, inplace=True)

For HOURLYPressureTendency, the feature value is in fact a categorical feature with three levels:

  • 0-3 indicates an increase in pressure over the previous 3 hours
  • 4 indicates no change during the previous 3 hours
  • 5-8 indicates a decrease over the previous 3 hours

Hence, we encode this feature into 3 dummy values representing these 3 potential states.

In [21]:
# Transform HOURLYPressureTendency into 3 dummy variables based on NOAA documentation
hourly_data['HOURLYPressureTendencyIncr'] = [1.0 if x in [0,1,2,3] else 0.0 for x in hourly_data['HOURLYPressureTendency']] # 0 through 3 indicates an increase in pressure over previous 3 hours
hourly_data['HOURLYPressureTendencyDecr'] = [1.0 if x in [5,6,7,8] else 0.0 for x in hourly_data['HOURLYPressureTendency']] # 5 through 8 indicates a decrease over previous 3 hours
hourly_data['HOURLYPressureTendencyConst'] = [1.0 if x == 4 else 0.0 for x in hourly_data['HOURLYPressureTendency']] # 4 indicates no change during previous 3 hours
hourly_data.drop(['HOURLYPressureTendency'], axis=1, inplace=True)
hourly_data['HOURLYPressureTendencyIncr'] = hourly_data['HOURLYPressureTendencyIncr'].astype(('float32'))
hourly_data['HOURLYPressureTendencyDecr'] = hourly_data['HOURLYPressureTendencyDecr'].astype(('float32'))
hourly_data['HOURLYPressureTendencyConst'] = hourly_data['HOURLYPressureTendencyConst'].astype(('float32'))

2.7 Rename columns

Before saving the dataset, we will rename the columns for readability.

In [22]:
hourly_data.columns
Out[22]:
Index(['HOURLYVISIBILITY', 'HOURLYDRYBULBTEMPF', 'HOURLYWETBULBTEMPF',
       'HOURLYDewPointTempF', 'HOURLYRelativeHumidity', 'HOURLYWindSpeed',
       'HOURLYStationPressure', 'HOURLYSeaLevelPressure', 'HOURLYPrecip',
       'HOURLYAltimeterSetting', 'HOURLYWindDirectionSin',
       'HOURLYWindDirectionCos', 'HOURLYPressureTendencyIncr',
       'HOURLYPressureTendencyDecr', 'HOURLYPressureTendencyConst'],
      dtype='object')
In [23]:
# define the new column names
columns_new_name = [
    'visibility',
    'dry_bulb_temp_f',
    'wet_bulb_temp_f',
    'dew_point_temp_f',
    'relative_humidity',
    'wind_speed',
    'station_pressure',
    'sea_level_pressure',
    'precip',
    'altimeter_setting',
    'wind_direction_sin',
    'wind_direction_cos',
    'pressure_tendency_incr',
    'pressure_tendency_decr',
    'pressure_tendency_const'
]

columns_name_map = {c:columns_new_name[i] for i, c in enumerate(hourly_data.columns)}

hourly_data_renamed = hourly_data.rename(columns=columns_name_map)
In [24]:
print(hourly_data_renamed.info())
print()
hourly_data_renamed.head()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 75119 entries, 2010-01-01 01:00:00 to 2018-07-27 23:00:00
Freq: 60T
Data columns (total 15 columns):
visibility                 75119 non-null float32
dry_bulb_temp_f            75119 non-null float32
wet_bulb_temp_f            75119 non-null float32
dew_point_temp_f           75119 non-null float32
relative_humidity          75119 non-null float32
wind_speed                 75119 non-null float32
station_pressure           75119 non-null float32
sea_level_pressure         75119 non-null float32
precip                     75119 non-null float32
altimeter_setting          75119 non-null float32
wind_direction_sin         75119 non-null float64
wind_direction_cos         75119 non-null float64
pressure_tendency_incr     75119 non-null float32
pressure_tendency_decr     75119 non-null float32
pressure_tendency_const    75119 non-null float32
dtypes: float32(13), float64(2)
memory usage: 5.4 MB
None

Out[24]:
visibility dry_bulb_temp_f wet_bulb_temp_f dew_point_temp_f relative_humidity wind_speed station_pressure sea_level_pressure precip altimeter_setting wind_direction_sin wind_direction_cos pressure_tendency_incr pressure_tendency_decr pressure_tendency_const
DATE
2010-01-01 01:00:00 6.0 33.0 32.0 31.0 92.0 0.0 29.969999 29.990000 0.01 29.990000 0.0 1.0 0.0 1.0 0.0
2010-01-01 02:00:00 6.0 33.0 33.0 32.0 96.0 0.0 29.969999 29.990000 0.02 29.990000 0.0 1.0 0.0 1.0 0.0
2010-01-01 03:00:00 5.0 33.0 33.0 32.0 96.0 0.0 29.969999 29.990000 0.00 29.990000 0.0 1.0 0.0 1.0 0.0
2010-01-01 04:00:00 5.0 33.0 33.0 32.0 96.0 0.0 29.950001 29.969999 0.00 29.969999 0.0 1.0 0.0 1.0 0.0
2010-01-01 05:00:00 5.0 33.0 32.0 31.0 92.0 0.0 29.930000 29.959999 0.00 29.950001 0.0 1.0 0.0 1.0 0.0
In [25]:
# Explore some general information about the dataset
print('# of megabytes held by dataframe: ' + str(round(sys.getsizeof(hourly_data_renamed) / 1000000,2)))
print('# of features: ' + str(hourly_data_renamed.shape[1])) 
print('# of observations: ' + str(hourly_data_renamed.shape[0]))
print('Start date: ' + str(hourly_data_renamed.index[0]))
print('End date: ' + str(hourly_data_renamed.index[-1]))
print('# of days: ' + str((hourly_data_renamed.index[-1] - hourly_data_renamed.index[0]).days))
print('# of months: ' + str(round((hourly_data_renamed.index[-1] - hourly_data_renamed.index[0]).days/30,2)))
print('# of years: ' + str(round((hourly_data_renamed.index[-1] - hourly_data_renamed.index[0]).days/365,2)))
# of megabytes held by dataframe: 5.71
# of features: 15
# of observations: 75119
Start date: 2010-01-01 01:00:00
End date: 2018-07-27 23:00:00
# of days: 3129
# of months: 104.3
# of years: 8.57

3. Save the Cleaned Data

Finally, we save the cleaned dataset as a Project asset for later re-use. You should see an output like the one below if successful:

{'file_name': 'jfk_weather_cleaned.csv',
 'message': 'File saved to project storage.',
 'bucket_name': 'jfkweatherdata-donotdelete-pr-...',
 'asset_id': '...'}

Note: In order for this step to work, your project token (see the first cell of this notebook) must have Editor role. By default this will overwrite any existing file.

In [26]:
project.save_data("jfk_weather_cleaned.csv", hourly_data_renamed.to_csv(float_format='%g'), overwrite=True)
Out[26]:
{'file_name': 'jfk_weather_cleaned.csv',
 'message': 'File saved to project storage.',
 'bucket_name': 'daxweatherproject-donotdelete-pr-a99epzqcuc5or2',
 'asset_id': 'a05ae92f-c315-41d7-8402-f1b78e85eb63'}

Next steps

  • Close this notebook.
  • Open the Part 2 - Data Analysis notebook to explore the cleaned dataset.

Authors

This notebook was created by the Center for Open-Source Data & AI Technologies.

Copyright © 2019 IBM. This notebook and its source code are released under the terms of the MIT License.

Love this notebook? Don't have an account yet?
Share it with your colleagues and help them discover the power of Watson Studio! Sign Up